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ABSTRACT 

This report describes the initial evaluation of a 
text compression algorithm against computer assisted instruction 
(CAI) material. A review of some concepts related to statist ical , text 
compression is followed by a detailed description of a practical text 
compression alg^ rithm. A simulation of the algorithm was programed 
and used to obtain compression ratios for a small sample of both 
traditional frame-structured CAI material and a new type of 
information-structured CAI material. The resulting compression ratios 
are to 1.5: 1 for both types of material. The simulation program was . 
modified to apply the algorithm to the lesson files of a particular 
frame-structured CAI subsystem used in the Air Force Phase II Base 
Level System. The compression in this case was found to be 1.3: 1 
because some uncompressible, frame- forma ting byl es were present in 
the lesson file. The modified simulation program was also used to 
take letter occurrence statistics on the text being compressed. From 
these^ a theoretical compression was calculated using a probalistic 
model of the compression algorithm. Theoretical compression was 
within two percent of measured compression, thus verifying the 
model's applicability. (Author) 
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FOREWORD 



One of the goals of Air Force Electronic- Systems Division 
Is the development of a technology for computer-ba'sed, personnel- 
support systems integrated into Air Force Information Systems. 
These support systems are required to improve the efficiency 
of man-computer interactions in the host Information Systems. 
They are designed to provide automated on-the-job training, 
performance- and decision-aiding for Information Systems per- 
Sonne 1. 

Task 280104, Computer-Aided Instruction Techniques, under 
Project 2801, Design Methodology for Military Information Sys- 
tems, was established to develop tools and techniques for 
computer-aided training, perf ormance-and decision-aiding in 
these systems. It is also concerned with new software engi- 
neering techniques which will permit cost-effective implementation 
of these aids. This study relates to the latter objective. 

This report, one in a series supporting Project 2801, 
addresses the problem of reducing the size of text files which 
constitute the bulk of the lesson files in the typical^computer- 
aided instruction (CAI) systems . The approach is to simulate 
a practical- text compression algorithm and test it against 
CAI lesson material. While the orientation of this study is 
toward CAI, the technique is generally applicable to reducing 
the size of text files in other systems such as data management, 
command and control, and intelligence data bases. 

This study was performed by Captain J. M. Knight, Jr. 
as part of his reserve training day duties . betwe;=ri May 1970 
and September 1971, including two 2-veek active duty tours. 
Dr. Sylvia R. Mayer, ESD/MCIT suggested .the study and served 
as. Air- Force Task Scientist. 

This Technical Report has been reviewed and is approved. 

SYLVIA R. MAYER, Ph.D. ' MELVIN B. EMMONS, vCo lone 1, USAF 
Project Scientist ' Director, Information Systems Technology 

Deputy for Command & Management Systems 
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ABSTRACT 



This report describes the initial evaluation of a text compression 
algorithm against Computer- Aided Instruction (CAI) material. A 
review of some concepts related to statistical text compression is 
followed by a detailed description of a practical text compression 
algorithm. ' A simulation of the algorithm was programmed and used 
to obtain compression ratios for a small sample of both traditional 
frame-structured CAI material and a new type of information-structured 
CAI material. Jhe resulting compression ratios are near 1.5 to one 
for both types of materials. The simulation program was modified to 
apply the algorithm to the lesson files of a particular frame-structured 
CAI subsystem used in the Air Force Phase II Base Level System, The 
compression in this case was found to be 1.3 to one because of , the 
presence?' in the lesson file of uncompress ible, frame formatting bytes. 
The modified simulation program was also used to take letter occurrence ^ 
statistics on the text being compressed. From tjhese, a theoretical 
compression was calculated using a probabilistic model of the com- 
pression algorithm. Theoretical compression was within two per cent 
of measured compression^ thus verifying ^the. model ^s applicability . 
The report closes with the raising of some questions and -a discussion 
of future work. 
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SECTION I 



INTRODUCTION * . . 

. Presently, lesson material for Computer Aided Instruction 
(CAI) occupies considerable disk space when the CAI system is 
brought on-line. For example, in the Computer-Directed Training 
(CQDIT) subsystemrof the Air Forcp Phase II Base Level systemj-^^ 
ecfcK 300 frame lesson is stated^^^ to occupy 121,600 bytes. Even 
the short, "Computer Operator's Course" contains the equivalent 
of 14 lessons; other courses , such as the personnel course con- 
tain many more. Accordingly the technological area of test 
compression is being reviewed for practical methods whereby CAI 
data bases may be reduced in size with only moderate computational 
expense. 

Section II presents c;n elementary discussion of statistical 
text compression and some indication of its performance on English 
text. However J, there also exists a simpler compression algorithm 
based on the practical .fact that, although data characters are 
stored in 8-bit bytes, only about one third of the potential 
256 characters arc actually used in current ADP systems; the 
remaining two- thirds characters can be used^'o encode frequently 
occurring character pairs into single, unused characters thus 
obtaining data compression. 

This report describes in more detail a simple, practical 
compression algorithm, its application to a small set of CAI 
data base material, and the results. Performance of the algorithm 
is modeled and the model is experimentally verified. In addition, 
a short ^scussion in Section VI provides guidance for future work. 



1 



SECTION II 



CONCEPTS IN STATISTICAL TEXT DATA COMPRESSION 

A. ' Bits 

Data is stored in bits or in groups of bits 5 called bytes. 
One ^tit^^ of -information represents the outcome of single yes or 
no decision. One bit can also represent a binary state of a given 
situation, An ordinary room light switch can store one bit of 
information, e.g. 5 "on" might mean «^'at home" and "off" might mean 
^Yiot at home, " . . 

Groups of . bits can represent more information. Two switches 
can represent two sequential binary decisions, i.e., four outcome^s, 
or situation states, such as "on, on"; . "on,off "; "off, on"; and 
"off, off". Three switches can represent eight states and, in general, 
N switches can represent 2 states,. - A byte consisting of eight bits 
can represent 256 characters such as A, 3, C, ... 1, 2, 3, ?, $, 

etc. Data is generally stored one character to a byte. Nine channel 
magnetic data-processing tape can store 800 bytes per lineal inch of 
tape because the eight bits of the byte are laterally distributed across 
the tape, along with a ninth bit, called a parity check bit. 

B, Entropy. 

Entropy is a property of the units, such as characters or symbols, 
which make up data. Entropy is a measure of the "surprisial^^, or -information 
value, of a symbol. It has the units of bits/symbol and a common^ designa- 
tion of H, A few simple examples will clarify perhaps the intuitive notion 
of entropy, ~. 

For instance , if it is equally likely that John is agoing to the 
seashore or the mountains this summer, and we hear that he is going to 
the mountains we are moderately^ informed, or shall we say, surprised. 
In this situation, the symbols ^taountain" and "seashore" have for us 
equal information value. They are said to have equal entropy. If, on 
the other hand, John historically goes to the mountains nineteen summers 
out of tventy and we hear he is going to the mountains, we are not terribly 
surprised or informed. The symbol ^taountain", in this case, possesses 
a low entropy^ information value, or surprisal content. If we hear that 
John is going to the seashore we are quite surprised and highly informed 
of the happening of a low probability event. The syir^bol "seasore" has 
a high entropy, information value, or surprisal content. The entropy 
of a symbol is related to the priori occurrence of that symbol. 



^ The mathematical measure of entropy of the ith ' 
symbol in a data" set is given by 

11^= - logo p^ (bits/symbol) (1) 

where is the a priori probability of occurrence of the ith 
symbol ,in a data set, A symbol occurring \ the time (p4^ = 0,5) 
has an entropy of one bit/symbol. One occurring -4 of trie time 
(p = 0.25) has an entropy of two bits/symbol. One 1/8 of the 
time has three bits/symbol and, in general 1/2 of the time has 
k bits/symbol entropy. Also, k may be fractional as well as 
inte^'rT'^depending on p^. 

C. S'tring Data 

Much data is transmitted and stored in the form of 
strings, i.e., connected sets of alphanumeric characters, or 
other symbols, issuing from an information ^^source^' and bound 
ultimately for an information ^^sink^'^^ Consider a source capable 
of generating four characters: A, B, C, and D, each occurring 
1/4 of the time, i,e., = Pb " Pn Pd " 0«25, The entropy 
of all characters is the same and thererore the average entropy 
of the source is also two bits/symbol. Each character A, B, 
C, and D may be represented in transmission by two on-off 
(binary) pulses and in storage by two binarily magnetized 
patches on a computer tape or disk unit. But now consider a 
source which exhibits. an unequal distribution of > A, ...B, C, and 
D symbols, e,g. , = 0.4, pg = 0,3, pg = 0,2 and pp = 0,1, 
Using equation (l; the entropies are calculated as H^^ = 1.32, , 
Hg = 1,74,. He = 2.32 and II = 3.32, all in bits/symbol. The 
average (or expected value of) the source entropy H- is given 
by • . ^ ^ — ^ 

Hg = 0,4H^ + 0.3 Hg +: 0.2 x'o.l Hp (2) 

The value of Hg is 1.846 bits/symbol. Note that this .value is 
less than the 2 bits/symbol average source entropy of the 
^'equally likely" source. A still more uneven occurrence distri- 
bution than that given above would result in a smaller source 
entropy. 

Although it is not obvious, the above source entropy 
value does lead us to suspect that'we- can find a code, i.e, , 
a mapping, between A, B, C, D and four groups of one. or more 
bits each such that, the average number of bits per code group 
is not only close to the source entropy but also is less than 
a straight two bits per character. This is indeed the case and 
the code is as follows: 
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A = 1 (one bit/symbol) 



B = 01 (tv;o bits/symbol) 
C = 001 (three bits/symbol) 

D = 000 (three bits/symbol) . The coded source sequence: 

110 110010 llOOO-*' 
in uniquely decodeable as the original source sequence: 

AABACBAD:.. / 

Considering the probability of occurrence of A, B, 
and D we obtain an average code length of 1.9 bits/symbol. 

This represents a slight compression of 1.052 relative 
to the original two bits/symbol. A more unequal character 
occurrence distribution would result in a higher compression 
ratio. Thus, we see that data compression can involve measur- 
ing the statistics of source s^ymbol occurrence, designing an 
efficient code, and designing both an encoder and a decoder, 
implementable in either hardware or software. 

D. Block Source Encoding 

Another form of encoding is to group symbols of a 
source string into blocks. Consider an example where the , . 

string consists of specification of right or left-handedness 
, and that, for our sample, right-handers putnumber left-handers 
by 19 to one. The probability of a right-hander, Pj^, is 0.95 
and the probability of a left-hander, Py, is 0.05. Simple symbol 
encoding of ^^0^^ for R and ^^V^ for L yields an average code word 
length of one bit/symbol. But the entropy of a binary source 
wi^h .95 and .05 probabilities is only 0.286 bits/symbol. This 
suggests that we can do better than merely encode R and L into 
"0^' and "1". However, it also suggests that the very best we 
can do is to obtain a compression of about 3.5. 

Now consider coding blocks of, let us say, three symbols. 
We now gel* a new data source by grouping the old data source 
into blocks of three. The new data source emits eight different 
symbols 1, 2,. . ., 8 each representing a possible combination 
, . of. three of the symbols from the old data source- The probabilities 
of symbol occurrence for the new data source are derivable from . 
the probabilities of symbol occurrence from the old data source. 
Assuming the occurrence of any symbol is independent of the 
occurrence of the previous symbol we obtain, for example, the 
probability of symbol 3 (^^RLR^^ - "010^0 from the product of 
probabilities 



ERLC 



4 



Pj^ . Pj^ . Pr = (0.95) (0.05), (0,95) = 0.04513. 
The 'results are summarized in Table 1. 

■Table I 



TABLE OF BLOCK. ENCODING ENTROPIES 

Symbols from Symbols from Probability of H of new 

Old data source New data source New source symbol source symbol 



000 


1 


,85738 


.22199 


001 


2 


.04513 


4,46977 


010 


a 


.04513 


4.46977 


oil 


4 


.00237 


8.72090 


100 


5 


.04513 


4.46977 


101 


6 


,00237 


8.72090 


no 


7 


.00237 


8.72090 


111 


' 8 


.00012 


13.02468 



H = 0.85906 for the new source. 

The entropy of the new source is 0.85906 bits/new source 
svTiibol. Notice that, on the basis of three old source symbols 
to one new source symbol, the entropy is also .2863 bits/old 
source, symbol. However , now we have, with eight symbols instead 
of only twOj more freedom to design an efficient code; There 
exists a technique which allows the construction of a code whose • 
coded entropy is within one bit/symbol of the entropy of the/ 
original source. For a block length of one the code is simply 
one bit in length for each source symbol "0" and "1", hence, 
the coded source en'rropy is one bit/symbol which is within one 
bit/symbol of the source entropy, which must lie between zero 
and 1.00. Table 2 gives the efficient code. 
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TABLE 2 



Table of EffioTont Code for Block Length = 3 



ERIC 



Source 
Synibol 


Code 




bability of 
Occurrence 


"Expected val 
of Symbol t. 


:1 


1 


I 


.85738 


.85738 


0 
*^ 


00 


2 


.04513 


.09026 


3 


010 


3 


.04513 


.13539 


4 


0111 


4 


.04513 


.18052 


5 


01100 




.00,237 


.01185 


6 


011010 


6 


.00237 


.01422 


7 . 


0110111 


7 


.00237 


.01659 


• 8" 


0110110 


7 


.00012 . 


.00096 
1.10717 



Average symbol length = 1,10717 



bits /new source symbol. 



From the above table we see that the average code word 
length is now 1.10717 bits/new symbol and this quantity represents 
three old symbols, such as RLR, This code yields a compression 
of 2.71 to one compared with a maximum possible compression of 
3.5 to one. The use of longer blocks, and more complex codes, 
will result in a closer approach to the maximum possible com- 
pression figure. In this example we have assumed independence 
of symbol occurrence- Should there be any symbol occurrence 
dependence, resulting in lower entropy^ block encoding will pick 
up this advantage also. Thus, we see that data compression not 
only in voLvaO' measuring original source occurrence, probabilities 
and devising efficient codes but also blocking the original 
source sequence into reasonable lengths, treating these as a 
new source, and then devising an efficient code based on the 
probabilities. >of the new source. 



SECTION III 
SOME TEXT COMPRESSION RESULTS 



Shannon ^ gives us an estimate of the entropy of 
English text as a function of he many previous letters are 
allowed to be known. An upper bound on compression can be cal- 
culated by dividing this entropy into the entropy of a source 
which puts out all letters randomly with equal probability. 
Table 3 gives entropies and compressions. 

Table 3 

Entropies and Compressions of an English Text Source Under 

Various Constraints 

Constraint Entropy (bits/'letter ) Compression 

None, 26 letters and one 4.76 1 



space equiprobable 






Letter and space frequencies 


4.03 


1.18 


'One letter known 


3.32 


1.43 


Two letters known 


3.1 


1.53 


Word frequencies' used 


■2.14 


2.22 



Shannon continued his investigation of english entropy - 
beyond the point where "N-grams^^ df english were known. An N- 
gram is a histogram giving the relative frequencies of combina- 
tions of N letters. By having people predict the next letter 
when shown the previous L letters Shannon was able to estimate 
entropies of english for constraint lengths close to 100 letters. 
For 10 - L ^ 15 the entropy was about 1.5 bits/letter (compression 
= 3.17) and for L = 100 it was .95 bits/letter (compression = 5). 

Unfortunately, compressors using constraint lengths of 
100 (^20 words, or so) appear completely beyond the state-of- 
the-art. However, single word dictionary type compressors do 
appear feasible. A simulated word. dictionary compression 
algorithm is discussed by .White '^^showing results of compressions 
between 1.4 and 1.7 to one with a /'small^^ dictionary and two 
to one with a 1000-word dictionary. For a restricted vocabulary 
situation, as elementary training and drilLCAI may produce^ we , 
probably can take two to one as a working value for statistical^ 
word text compression. This figure compares favorably with 
Shannon ^s figure of 2.22 for word frequency compression. 
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Consider now the algo:j;ithin which is th>> object of this 
report, SnyderFian and Hunt ^^-^ report on a practical text 
comprL^ssion algorithm, used at the Science Inforrhation Ex- 
change, Smithsonian Institution, to compress the text portion 
of a 200,000 record on-line file from an average of 851 to 
553 characters per record. This represents an implemented 
compression of 1.54 relative to eight bits/character , a very 
, respectable f^* ure. 

Te avi -^ij of 8/1.54 =■ 5.2 bits/character represents 
a net conr ^^^^"^ ^ of 1.245 to one relative to 6.46 bits/ 
character lor 88 equal frequency characters. This net com- 
pression lies between Shannon^s theoretical compression (1.18) 
for an english text source when letter-space frequencies are 
known and the compression (1.43) when one previous letter is 
known. In summary, the literature indicates character text 
compression at around 1.5 to one and word text compression at 
around 2 to one. 



SECTION IV 
THE SNYDERMAN-HUNT COMPRESSION ALGORITHM 



This section discusses more formally the Snyderman- 
Hunt- algorithm. The algorithm was chosen to evaluate compress- 
ibility of CAI material because of its practicality, its demon- 
strated performance on english text and its speed. The speed 
of this '^'^<^orithm on a 360/40 is on the order of 65-75 milli- 
'■'f\r thousand characters, compressing or decompressing, 
upe/v? ^ on the following principles. 



Characters are normally^ stored one per 8-bit byte. With 



eight bits, one of 2 = 256 characters, can be specified by each 
byte. At the Scientific Information Exchange only 88 characters 
are used: 52 upper and lower case alphabetics, 10 numerics and 
26 special characters such as comma period, dollar sign, etc. 
This leaves 256-88 = 168 ^^unused" characters. These ot he wise 
unused 8-bit combinations can be utilized to represent the more 
commonly occurring pairs of characters in the 88 :iised character 
set, thus effecting a compression. 



More specifically, it is convenient to define four sets: 



T 




C 




CC 




(3) 



master characters} 



These sets are related as follow: 



MC C CC ^ C O T ' (4) 

A further set CP, for "combined pairsl!, can be formed 
of all ordered pairs of MC and CC, i.e. 




X 




(5) 



The members of CP can be placed in one-to-one correspondence 
with the difference set D defined as ' . 




(6) 



The set of noncombining characters NC is given as 



NC 




(7) 



For example Snyderman and Hunt choose: 

MC = (space, I, 0, N, T, uj (8) 

CC = {space, A through I, L through R through w] (9) 

The set MC has 8 members; CC has 21, The set of all combined 
pairs CP has 8 x 21 = 16^8 membeiS which are one-to-one related 
to the 168 members of difference set D. - 

The algorithm works by examining a character in^^a 
string. "^'C the character is a member of MC the next character 
is examined. If the next character is a member of CC then the 
.two-character combined pair is coded into a singlei= unused 
character and stored. If the first character is not a member 
of MC, it is stored as is. If ..the first character is a member 
of MC but the second one is not a member of CC then the two 
characters are stored individually, as is. Thus we see that 
compression is dependent upon, both the probability of finding 
a master character and the conditional probability of finding 
a combining "character given the finding of a master character. 
An analysis of the algorithm is presented in Appendix A. 
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SECTION V 



EXPERIMENTS 



A. Experiment One '\ 

1, Description 

A computer program was written to simulate the Snyder- 
man -Hunt algorithm. The simulation did not actually code the 
characters, but rather /^kept score^^ on the number of characters 
that the algorithm would output f or ^ each line^pf input text. 
Compression ratio is the number of characters input divided by 
the number of characters output. The program, called TXTCMP, 
is interactive, being implemented in GBS (a subset of PL/l) 
for operation from a TTY or IBM 2741 terminal. TXTCMP is fed 
a line of text at a time and returns both line compression anid 
total compression since the start of the program. The program 
listing and flow chart is reproduced in Appendix B. 

The experimental material was chosen from two different 
types of CAI data bases: frame-structured and information- 
structured. The former was taken from the Computer Operator's 
course of reference 1, the latter from reference 6. Both are 
reproduced in Appendix C, The lines were entered exactly as 
shown in Appendix C, spaces* included , from the left most . 
character position as a reference , and the compressions were 
obtained. In this experiment the sets chosen by Snyderman and 
Hunt for master characters and for combining characters were 
used. The set of noncombining characters in this experiment 
was everything else on the IBM 2741 keyboard recognized by 
CPS. 

In the Snyderman-Hunt application, 88 characters were 
'valid, leaving 168 for encoding character pairs. The Snyder-. 
.man-Hunt algorithm can be applied to compressing text in CPsC'^) 
because CPS also, uses or admits in characters strings, 88 
characters, leaving 168 for encoding character pairs. These 
results also apply to compressing text in the CODIT (Computer 
Directed Training) system because it is written into the Air 
Force Phase II Base Level System via the Burroughs B3500 COBOL 
language. COBOL uses 53<88 characters ^ leaving 203 > 168 char- 
acters for encoding character pairs^ Indeed compression 
might be slightly better when implemented in the B3500 environ- 
ment, because the 203 unused characters will accoiranodate 25 
combining characters as opposed to only 21 in reference 5. 
Alternatively 9, rather than 8, master characters could be 
accommodated, because the product of 9 master and 21 combining 
characters is less than the 203 characters available. 
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2. Results 



For the frame structured material the average compression 
was .1.473, with individual lines (except those with a single 
space) ranging between 1.148 and 1. 700. For the information- 
structured system material the average compression was 1.538 
with a low of 1.261 and a high of 1.875 for individual lines. 
There is no particular accounting for the slight (4.4%) 
difference in average compression, because the spread in 
individual line compression is quite large in both eases 
with considerable overlap. From Figure 1 it is seen tliii . , 0 
average compression settles statistically within a few alines. 
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B. Experiment Two 



1. Description 

The objective / experiment two is to obtain an estimate 
of compression for the Snyderman-Hunt algorithm when applied to 
the actual lesson file structure of CODIT, It is found ^ ^that 
the lesson file of CODIT contains both file structure specifi- 
cation bytes, which are not compressible, and lesson text bytes, 
which are. The file structure bytes occur according to Table A. 

File Structure liytes 



Application 


Number 


of Byt 


Frame Number 


4 


(per 


frame ) 


Frame Type 


2': 


'(per 


frame ) 


■Frame Length 


2 


(per 


frame ) 


Group Number 


1 


(per 


group ) 


Group Length 


2 


(per 


group ) 


Line Number 


A 


(per 


line) 


Line Length ' 


3 


(per 


line) 




" Table 


4 





The program TXTCMP was modified (TXCP2) to add ^bverhead" 
l)ytes to the compression calculation in the afeount of 14 + 3 x 
number of groups + 7 x number of lines each time a new frame 
of CAI material was encountered. As an example, the CODIT print- 
out shown in Figure 1 of Appendix C contains three frames with 
frame two containing three groups and six (numbered) lines. 

When the CODIT CAI material was entered, orily the 
numbered lines were entered for the compression calculation. 
It will be recalled that in experiment one all lines as shown 
in the figure were entered. The line numbers and the two spaces 
beyond were not entered; only the text (course author generated) 
to the right of this point is used. This is because all other 
(formatting) characters can 'be accounted for by the CODIT 
master program reading the ^^overhead" bytes and producing 
therefrom the non-text characters in the printout., 

2. Results 

The total CODIT subsystem compression for the material 
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i:i ! ,-.11. s i and 2 of Appendix C is 1,318. Whilo this coni- 
presijioii is less than that obtained using all the characters 
in Figures 1 and 2, it is a more realistic value because the 
CODIT file structure "overhead'' bits are taken into account, 
; Also, it is a conservative (low) value because the frames in 
' the experimental set have very little expository text material. 
The frames are largely for questioning the trainee rather than 
for instructing him. Oneccan reasonably expect an experimental 
i set containing a mix of questioning frames and instructing 

o frames to yield a higher compression. Even so, the 1,318 figure 

has useful implications. In the CODIT subsystem it means 
reducing each 121,600 byte lesson file by about 28,000 bytes 
or, alternatively, putting 30% more lessons on disk for the 
same CAI' file allocation in the Air Force Phase II Base 
Level System. Putting more lessons on-line gives increased 
daily flexibility to the OJT/CAI program. Using less disk for 
CAI increases the chances for its acceptance since it leaves 
adequate disk space for the other functional areas, such as 
personnel, finance and civil engineering, 

C; Experiment Three 

1, Description 

The objective of experiment three is to verify the anal- . 
ytical model of the Snyderman-Hunt algorithm developed in Appendix A, 
The ess^ce of the model is equation (7), Appendix A, which pre-- 
diets compression on the basis of p-j^, the probability of a master 
character occurring, and p^p2 the joint probability of both a 
master and a combining character occurring together. Should 
the model be verified to an engineering degree of accuracy, it 
would then be possible to select more easily optimum, master 
and combining characters sets because pi is simply related to 
single letter and space relative occurrences in english and 
p P2 is also simply related to double letter and space relative 
occurences. When TXCMP was developed into TXTCP2, provision 
was made to measure p^ and p2 and Pj^p^ 'the text portion of 
the experimental material, A theoretical,, or predicted, text 
compression was calculated; The experimental material used 
was the text portion (1003 characters) of Figures 1 arid 2 of 
Appendix C. 

2. Results , 

^ . Using the Irext material only, i.e,, no CODIT subsystem^ 

"overhead^^ bytes considered, it is found that the material of 
Figures 1 and 2, Appendix C, yield P = ,566, p^p2 = -531 and 
a theoretical compression of 1.513. This valu-e compares quite 
well (within 2%) of the experimentally measured text compression. 
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1,530, Burthermore, examination of cumulative measured text 
compression and cumulative theoretical text compression as 
it builds up on a line-by-line basis shows that the compression 
predicted by equation (7) of Appendix A is stable and always 
within 2.5 per cent, thus indicating a valid model for the 
Snyderman-Hunt algorithm. 



/ 
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SECTION VI 



^CONaUSIONS, QUESTIONS AND RECOMMENDATIONS 

Based on these results > three major conclusions follow: 

1. A working figure of 1,5 may be taken for the practical 
compression of CAI text material. 

2. When frame formatting overhead bytes are taken into 
account in a typical CAI system^ the compression figure becomes, 
conservatively^ 1,3 to one. 

3. It is possible to adequately model the Snyderman-Hunt 
algorithm and predict compression performance within a few per 
cent, based on text statistics. 

Given these conclusions several timely questions may be 
raised: 

How can the Snyderman-Hunt algorithm be optimally applied 
to CODIT which is now being implemented Air Force-wide? Where 
wpuld the compression and decompression algorithm be inserted 
into the CODIT systfem flow diagram (pg. 50 of reference 1)? 
Can you patch "a B3 500 assembly language compression decompression 
algorithm into a compiled -COBOL CODIT program? Given that COBOL 
uses only 53 characters, what is now. the optimum master and com- 
bining character sets? What is the dollar saving in reduced 
disk files and magnetic tapes? By how much is this dollar saying 
offset by the 75-odd microsecond per character CPU time cost? 
The dollar saving questions can be approached in two ways: 

1, By taking gross costs from the current B3500 Base Level 
System installation with estimates of CAI file space, CAI 
character throughput, and B3 500 speed for compressing and de- 
compressing, it is possible, in-house, to arrive at a rough 
estimate of dollar saving. 

2, By putting this problem to industry as a contracted study 
wherein the contractor designs an optimal compression system 
based on extensive CAI data base material, does a preliminary 
system design around current or projected hardware, and cal- 
culates relative costs of going compressed and uncompressed 
within the system. 

It is recommended that (1) above, be accomplished and, 
based on the outcome, (2) be considered, perhaps as part of 
contract definition for Air Force systems beyond B3500. It . 
is also recommended that text compression be considered if 
CODIT is rewritten in JOVIAL for DAFCCS application. Finally 
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it is recommended that the Snyderman-Hunt algorithm be ex- 
perimentally applied to other Air Force textual data bases, 
such as intelligence. 
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APPENDIX A 



Analysis of the Snydermcin-Hunt 
Text Compression Algorithm 



Consider a string of N characters. As a character is 
examined to see if it is a master character^ there is the 
possibility that either one or two characters will be read in, 
Let p-j^ be the probability that the character examined is a 
master character and l-p-^ the probability it is not. If the 
character is^ a master character, then a second character will 
be read in; if it is not, then only the single character is - 
read in, and the cycle repeated. The expected number of 
characters input, per cycle, is given by 



EGI = 2 (pi)>+ 1 (1-p^) 
= 1 + Pi . 



(1) 



For a string of - N characters the number of read cycles R is 
given by 

N ; (2)^ 

'1 + Pl 



R = 



When a master character is found^ with probability p-j^, 
two possibilities exist: the next character will be a comBining 
character, or it will not. Let bo the probability that the 
next character will be a combining character and 1 - p 2 that 
it will not. I If the second character is a combining character, 
it will be combined with the master character and only one 
character will be read out.. If the second character is not a 
combining character, then two characters will be read out. If 
.the first character is not* a master character only one character 
will be read out. These rules lead to the expected number of 
characters output per cycle, being given by 

ECO = 1 (p P2:).+ 2 (p. (I-P2) ) + 1 (1-pJ 

= 1 + Pl-Pl. P2 

The expected number of characters read out NO, per line 
of N characters in, is given by 

: NO = R (ECO) (4) 

NO = R (1 + - P2) ^ (5) 
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Compression C is defined as the number of characters 
N in the line divided by the number of characters NO read out 
from the line processing, i.e. 

r = N (6) 

Substituting previous work in the above, we relate expected com- 
pression to the probabilities p-^ and P2. 

C := R (1 + pj^) / R (1 + pi - pi P2) 

/^•^ Pi ^ (7) 

1 + PI - PI P2 

Note that if all first read characters are master 
characters, p-, = 1, and if all second read characters are com- 
bining characters, p^ = 1, then C is a maximum and equal to 2. 
On the other hand, ir ho master characters occur, p^^ = 0, then 
compression is at a minimum and equal to unity. Since p^^ is 
the probability of finding a ^master character p(MC) and p is 
the probability p(CC/MC) of finding a combining character, 
given a master character, we see that P-,P2 is the joint proba- 
bility p(MC,CC) of finding a master character and a combining 
character together. Both p(MC) and p(MC,CC) can be experimentally 
determined for a given data base, such as english, once a table 
of first and second order occurrences is compiled and the sets 
of master characters and combining characters are defined. The 
sets can be adjusted, within the constraints given in the text, 
to maximize the expected c^ ^^.pression. 
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TXTCMP Program Listing 

Note 1: The program operates by working its way (via POINT) 
through the LINE of text, character by character. If a master 
character is not found, both the compressed and uncompressed 
bit count are augmented by one byte: if a master character is 
found, the next character is tested for being a combining 
character. If the next character is a combining character, the 
compressed bit count is augmented by one byte and the uncom- 
pressed bit count by two bytes, otherwise both compressed and 
uncompressed bit counts are augmented by two bytes. An isolated 
master character at the end of LINE will be so identified (pro"" 
gram line 350) and cause augmentation of both compressed and 
uncompressed bit counts by one byte. Success of the end of line 
test initiates printout. 

Note 2: Program line 426 is not essential to operation; 
it merely prints the value of POINT occasionally to let you 
know the program is functioning during the wait between line 
input and compression printout. 

Note 3: Variable listing 



Variable Explanation 

M(l). bfaster Character array 

CO (I) Combining character array 

TUG Number of bits, uncompressed, from beginning 

of program 

TC . Number of; bitsy . compressed, from beginning 

of program 

LINE » . . .Character variable containing a line of text 

UC Number of bits, uncompressed, in a given line 

C . , , Number of bits, compressed, in a given line 

POINT ... . . . . .A text pointer variable 

TESTl , . .A character variable containing one char- 
acter being tested to see if it is a master.,, 
character 

I , , , , . .A general indexing variable 
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TEST 2 A character variable containing one char- 
acter beomg tested tp see if it is a com- 
bining character. 

TOTCMP. ...... .Total Compression since beginning of program 

LNECMP Compression of the given line above 

Note 4: Label Li^sting 

Label Explanation 

TXT.CMP .The name of the progrcjH*: ^^Tect Compression" 

LNEGET. . . . . • . .Get a new line of text 

CHRGET. . ... . . .Get a new character from the line 

V' 

NXTCHR, ...... .Get the next character (following an identi- 
fied master character) 

AUGMT2 Augment the bit count by 2 bytes (16 bits) 

AUGMTl Augment the bit count by 1 byte (8 bits ) 

EOLTST End of line test 

EOL End of line 



ERIC ^ 



[lC-55,873] 



EXECUTE 








[""-""enter a 

UNEOFTEXT 








GET CHARACTER 
FROM BUFFER 
VARIABLE "lIN£^ 



^CMARACTER>. 
<_ A MASTER y- 



nxtcmR 



CHARACTER 
YES 




GET NEXT 
CHARACTER 
FROM BUFFER 
VARIABLE "line' 




^NEXT 
J^CHARACTER 
A COMBINING ' 
CHARACTER? 

Tyes 



AUGMENT T 

andtc by 

16 BITS 



AUGMENT C 
AND TC BY 
8 BITS 



PAINT VALUE 
OF POINT AS 
CUE TO 0«TR 
THAT PGM JS 
WORKING 




COMPUTE TOTAL 
COMPRESSION 

ANO LIME 
COMPRESSION 



THIS PROGRAM SIMULATES THE 
TEXT COMPRESSION PROGRAM OF 
SNYOERMAN ANO HUNT. DATAMATION 
DEC EMBE R I , 1*70, 



OUTPUT TOTAL 
COMPRESSION 

ANOLINC 
COMfRESSION 




HITDW CHART FOR ' TftrXTCM P 



10.. /'-THIS PPtJnRAM SIWILATES 7HF TIXJ COMPACXItON ALGORITHM 

15. ;f«sNYnrR«A>}! A'Jn mu^t, datamatiow, dec UJi:g7o.»/; 

'^IT LISTC ' ); 

•?2.!S- m L ISTCEXECUTINH T^XT Cn^tl^ESi ION . ' >; 

5!eT USTCPIIASE NOTE: P«TE<^ AUl CHARACTOS IN UPPERCASE:*); 

n^. '?^lt LiSTrCALSO NOTS: LI«tT L*IHr"Tn 70 CHAPACTCPS. ' ) ; 

*t. ^It LIST(MnT AHN 0^ I IfM 27lil7CnMlf1ALt)P BREAK ON TTY TO END MOORAW.'); 

TOt H ST ( » ' ) ; 

^r'C Tn^riARc M(8) CH4A(2), LINE CMA'*(70) VAR; 

/55, THCLAr>E CC(?1) CHftt^d); 

60. ;rHCLA«^E TEST! CHARd), TESTJ CHARd); 

. **t2 ) - ' A ' ; 

7*3, mjI3)-'E»; 

8fa. wsClj)-'!'; 

'■^rx. I«C7)-'T'; 

Mr^, 'i«iC:8)-'U'; 

XIT. . !X ( 2 ) ■ ' A ' ; 

1-70. CC(ii)»'C'; 

125. 'CC(5)«'t)'; . 

130. :CC(6)-'E'; ' 

135. CC(7)*'F*; 

IkO, CC(8)«'G'; 

IkS. CC(9)-'H'; 

150, CC(10)-'I'; 

155. CC(11)«'L'; 

160. CC(12)«'M'; ' 

165. CC(a5)-'N'; 

170. CC(ll*)-'0'; 

175. CC(15)-'P'; 

180. CC(16)-'R'; 

1«5. CC(17)«'S*; 

190. CC(18)-'T\* 

195. CC(19)«'U'; 

200. CC(20)«'V'; 

205. CC(21)«'W'; 

2lfO. TUC-0; 

285. VO-0; 

290. LNEGET: PUT L tST( 'L IHE' ); 

295. «AD INTO(LIMe) ; 

300. UC-0; 

305. C«0; 

310. PniRT-1; 

315. CHRGET: TESTl«SUflSTR( L II9E, TO I NT, 1 > ; 

320. TtlC«TUC*8; 

525. UC«UC*8; 

535. 00 1-1 TO 8; 

5li0. IF TESTl«1<f(f) THEN fiP TO NXTCHP; 

5I»5. END ; 

5«»6. GO TO AUOWTl; 

350. NXTCHR: IF PO I NT«LENGTMC L I NR ) THEN GO TO AUGMTl; 

555. PniNT«POIKT*l; 

360. TPST2-iSUBST«%(LINE,P0lNT,l); 

365. TUC«TUC*8; 

370. UC«UC*8; 

380. no !•! TO 21; 

585. IF TESr2»CC(l) THEN HO TO AUfiMTl; 

390. END ; 

5^5. AurMt2: C«C*16; . 

«»00. TC-TC">16; 

•♦05. GO Tn EOLTST; 
mo. AUHMTl: C«C*8; 

l»15. TC-TC">8; 

U20. EOLTST: IP POINT«LENnTH(LfMF) T'lRN fiO TO EOL; 

Ii25. POINT«POINT*l; 

"♦26. ■ IF POlNT/6-T.^UNr(PniWT/6) THEN PUT LI ST( POINT); 

i»50. no TO CHRGET; 

1*35. EOL: TOTCHP-TUC/TC; 

iii»0. L«CMP«UC/C; 

li«>5. PUT LISTC '); 

1*50. PUT LISTCLINE COMnrEfisinNM; 

ass. PUT LIST(LNECMP); 

li60. PUT LtSTCTOTAL CO^»PPESSI ON ' ) ; 

lf65. Pin LfST(TnTCMP); 

Ii70. PUT LIST( ' ' ); 

1*75. on TO CNEfiET; 

1*80. Efin TXTCMP; 



riouRE 1^2 CPS Lism-K or prockam 

TXTCMP: "TEST COHBRESS ION" - 
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APPENDIX C 



EXPERIMENTAL MATERIA::. 



...Reproduction of frame-strtictured and::inf ormation-structured 
CAT material. All parts of all lines containing one or more 
characters constitutes the experimental ..set for experiment one. 
Only the text portions of numbered lines in Figures C-1 and C-2 
constitute the experimental set for experiments 2 and 3. 
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LESSON OOlWOOO DATE WRITTEN 160569 PAGE 1 

FRAME 1.0 TYPE Ml LABEL 000700 
G.2 TEXT 

1.0 VIT PROGRAMMING LANGUAGES?? 

2.0 DO YOU WANT TO TRY THE LESSON ON PROGRAMMING LANGUAGES 
^ 3.0 OR DO YOU THINK YOU CAN SKIP IT? 

G.3 ANSWERS 

1.0 A+I WILL TRY THE LESSON ON PROGRAMMING LANGUAGES. 
2.0 Bfl THINK I KNOW ENOUGH TO SKIP IT. 

G.4 ACTIONS 

1.0 A F:FINE. LET'S BEGIN. 8:31 

2.0 B FrWE'LL GIVE YOU A LITTLE TEST JUST TO MAKE SURE. 
FRAME 2.0 TYPE Ql LABEL 
G . 2 TEXT 

1.0 "WHAT DOES COBOL STAND FOR? 

G.3 ANSWERS ' . 

1.0 0 SET KEYWORD ON 
• 2.0 0 SET PHONETIC ON 
3.0 0 SET ORDER ON 

4.0 A+ COMMON BUSINESS ORIENTED LANGUAGE 
FRAME 3.0 TYPE Ql LABEL 
G.2 TEXT 

1.0 WHAT DOES FORTRAN STAND FOR? 
G.3 ANSWERS 

1.0 A+ FORMULA TRANSLATION 

FIG UKE C-1 CODIT SUBSYSTEM FRAME 
STRUCTURED CAI MATERIAL 



26 



LESSON 000700 DATE WRITTEN 160569 PAGE 2 
FRAME 4.0 TYPE Ql LABEL 
C.2 TEXT 

1.0 WHAT DOES RPG STAND FOR? 

■ 0.3 ANSWERS 

1.0 A+REPORT PROGRAM GENERATOR 
FRAME 5.0 TYPE Ml LABEL 
G.2 TEXT 

■1.0 • WHAT IS MADE UP OF I'S AND Q'S? 

G. 3, ANSWERS 

1.0 A+MACHINE LANGUAGE 

2.0 B PROCEDURE-ORIENTED LANGUAGE 

3.0 C RPG LANGUAGE 

4.0 .D OCTAL LANGUAGE 

5.0 E NONE OF THE ABOVE 

FRAME 6.0 TYPE Ql LABEL 

■ G..2 TEXT ~ 

1.0 WHAT DO YOU CALL MACHINE-SPECIFIC INSTRUCTIONS USED BY A 
2.0 PROGRAMMER. SPECIALIST TO REPRESENT EACH MACHINE OPERATION? 
3.0 (THE WORD 'MACHINE' SHOULD NOT BE INCLUDED) • 

G.3 ANSWERS 

1.0 A+MNEMONIC . - 

2.0 B+SYMBOLIC 

3.0 C+SYMBOLIC CODE 

FRAME 7.0 TYPE Dl LABEL 

G.2 CONDITIONS 

1.0 IF GQ 2 WRONG 2-6 F: YOU'RE OFF TO A BAD START. YOU'D BETTER 
2.0 F:TRY THE LESSON. B:M0D7 

FIGURE C- 2 CODIT SUBSYSTEM FRAME 
STRUCTURED CAI MATERIAL 
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(RPAQQ LATITUDE (((ON LATITUDE) 
(DET THE DEF 2)) 
NIL 

(SUPERC NIL (DISTANCE NIL ANGULAR (FROM NIL 

EQUATOR))) 
(SUPERP (I 2) 

LOCATION) 
(VALUE (I 2) 

(RANGE NIL -90 90 )) 
i (UNIT (I 2) 

DEGREES))) 



(RPAQQ ARGENTINA (((XN ARGENTINA) 
(DET NIL DEF 2)) 
NIL 

(SUPERC NIL COUNTRY) 
(SUPERP (16) 

SOUTH/ AMERICA) 
AREA (I 2) 

(APPROX NIL/120000000 
(LOCATION NIL SOUTH/ AMERICA (LATITUDE (I 2) 
(RANGE NIL -22 -55)) 
(LONGITUDE (I 4) ■ . 

(RANGE NIL -57 -71)) 
(BORDERING/COUNTRIES (I 1) 
(NORTHERN (I 1) 

BOLIVIA PARAGUAY) 
(EASTERN (I 1) 

(($L BRAZIL URUGUAY 
NIL 

(BOUNDARY NIL URUGUAY/RIVER) ) ) 

(CAPITAL (II) 

BUENOS/AIRES) 
(CITIES (I 3) 

(PRINCIPAL NIL ($L BUENOS/ AIRES CORDOBA ROSARIO 

MENDOZA LA/ PLATA TUCUMAN))) ^. . 

(TOPOGRAPHY (II) 

VARIED 

(MOUNTAIN/CHAINS NIL (PRINCIPAL NIL ANDES 

(LOCATION NIL (BOUNDARY NIL (WITH NIL 
- CHILE))) 
■ •• (ALTITUDE NIL (HIGHEST NIL ACONCAGUA 
(APPROX NIL 22000)))) 
(SIERRAS NIL (LOCATION NIL ($L CORDOBA 

BUENOS/ AIRES)))) * 
(PLAINS NIL (FERTILE NIL USUALLY) 
(($L EASTERN CENTRAL) 

NIL PAMPA) 
(NORTHERN NIL CHACO))) 

: FIGURE C-3 THE UNITS FOR LATITUDE AND ARGENTINA (FRAGMENTS) IN SCHOLAR, 
AN INFORMATION STRUCTURED CAI SYSTEM 
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